{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I_i1q1TWG9zH"
      },
      "source": [
        "# Deep Crossentropy method\n",
        "\n",
        "In this section we'll extend your CEM implementation with neural networks! You will train a multi-layer neural network to solve simple continuous state space games. __Please make sure you're done with tabular crossentropy method from the previous notebook.__\n",
        "\n",
        "![img](https://watanimg.elwatannews.com/old_news_images/large/249765_Large_20140709045740_11.jpg)\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "t4CJ1sRyG9zJ"
      },
      "outputs": [],
      "source": [
        "import sys, os\n",
        "if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):\n",
        "    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash\n",
        "    !touch .setup_complete\n",
        "\n",
        "# This code creates a virtual display to draw game images on.\n",
        "# It will have no effect if your machine has a monitor.\n",
        "if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
        "    !bash ../xvfb start\n",
        "    os.environ['DISPLAY'] = ':1'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C2xd5vPwPVCb"
      },
      "outputs": [],
      "source": [
        "# Install gymnasium if you didn't\n",
        "!pip install \"gymnasium[toy_text,classic_control]\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_2zbc7ahG9zK"
      },
      "outputs": [],
      "source": [
        "import gymnasium as gym\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "%matplotlib inline\n",
        "\n",
        "# if you see \"<classname> has no attribute .env\", remove .env or update gym\n",
        "env = gym.make(\"CartPole-v0\", render_mode=\"rgb_array\").env\n",
        "\n",
        "env.reset()\n",
        "n_actions = env.action_space.n\n",
        "state_dim = env.observation_space.shape[0]\n",
        "\n",
        "plt.imshow(env.render())\n",
        "print(\"state vector dim =\", state_dim)\n",
        "print(\"n_actions =\", n_actions)\n",
        "\n",
        "env.close()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z72_alhdG9zK"
      },
      "source": [
        "# Neural Network Policy\n",
        "\n",
        "For this assignment we'll utilize the simplified neural network implementation from __[Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)__. Here's what you'll need:\n",
        "\n",
        "* `agent.partial_fit(states, actions)` - make a single training pass over the data. Maximize the probability of :actions: from :states:\n",
        "* `agent.predict_proba(states)` - predict probabilities of all actions, a matrix of shape __[len(states), n_actions]__\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wLItY4unG9zL"
      },
      "outputs": [],
      "source": [
        "from sklearn.neural_network import MLPClassifier\n",
        "\n",
        "agent = MLPClassifier(\n",
        "    hidden_layer_sizes=(20, 20),\n",
        "    activation=\"tanh\",\n",
        ")\n",
        "\n",
        "# initialize agent to the dimension of state space and number of actions\n",
        "agent.partial_fit([env.reset()[0]] * n_actions, range(n_actions), range(n_actions))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 45,
      "metadata": {
        "id": "eyFS3oUmG9zL"
      },
      "outputs": [],
      "source": [
        "def generate_session(env, agent, t_max=1000):\n",
        "    \"\"\"\n",
        "    Play a single game using agent neural network.\n",
        "    Terminate when game finishes or after :t_max: steps\n",
        "    \"\"\"\n",
        "    states, actions = [], []\n",
        "    total_reward = 0\n",
        "\n",
        "    s, _ = env.reset()\n",
        "\n",
        "    for t in range(t_max):\n",
        "\n",
        "        # use agent to predict a vector of action probabilities for state :s:\n",
        "        probs = <YOUR CODE>\n",
        "\n",
        "        assert probs.shape == (env.action_space.n,), \"make sure probabilities are a vector (hint: np.reshape)\"\n",
        "\n",
        "        # use the probabilities you predicted to pick an action\n",
        "        # sample proportionally to the probabilities, don't just take the most likely action\n",
        "        a = <YOUR CODE>\n",
        "        # ^-- hint: try np.random.choice\n",
        "\n",
        "        new_s, r, terminated, truncated, _ = env.step(a)\n",
        "\n",
        "        # record sessions like you did before\n",
        "        states.append(s)\n",
        "        actions.append(a)\n",
        "        total_reward += r\n",
        "\n",
        "        s = new_s\n",
        "        if terminated or truncated:\n",
        "            break\n",
        "    return states, actions, total_reward\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4xgrTCgJG9zL"
      },
      "outputs": [],
      "source": [
        "dummy_states, dummy_actions, dummy_reward = generate_session(env, agent, t_max=5)\n",
        "print(\"states:\", np.stack(dummy_states))\n",
        "print(\"actions:\", dummy_actions)\n",
        "print(\"reward:\", dummy_reward)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p85lt16qG9zL"
      },
      "source": [
        "### CEM steps\n",
        "Deep CEM uses exactly the same strategy as the regular CEM, so you can copy your function code from previous notebook.\n",
        "\n",
        "The only difference is that now each observation is not a number but a `float32` vector."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 53,
      "metadata": {
        "id": "4On-p7p4G9zL"
      },
      "outputs": [],
      "source": [
        "def select_elites(states_batch, actions_batch, rewards_batch, percentile=50):\n",
        "    \"\"\"\n",
        "    Select states and actions from games that have rewards >= percentile\n",
        "    :param states_batch: list of lists of states, states_batch[session_i][t]\n",
        "    :param actions_batch: list of lists of actions, actions_batch[session_i][t]\n",
        "    :param rewards_batch: list of rewards, rewards_batch[session_i]\n",
        "\n",
        "    :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions\n",
        "\n",
        "    Please return elite states and actions in their original order\n",
        "    [i.e. sorted by session number and timestep within session]\n",
        "\n",
        "    If you are confused, see examples below. Please don't assume that states are integers\n",
        "    (they will become different later).\n",
        "    \"\"\"\n",
        "\n",
        "    <YOUR CODE: copy-paste your implementation from the previous notebook>\n",
        "\n",
        "    return elite_states, elite_actions\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xc40V4DaG9zM"
      },
      "source": [
        "# Training loop\n",
        "Generate sessions, select N best and fit to those."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 54,
      "metadata": {
        "id": "PPwVKwF7G9zM"
      },
      "outputs": [],
      "source": [
        "from IPython.display import clear_output\n",
        "\n",
        "\n",
        "def show_progress(rewards_batch, log, percentile, reward_range=[-990, +10]):\n",
        "    \"\"\"\n",
        "    A convenience function that displays training progress.\n",
        "    No cool math here, just charts.\n",
        "    \"\"\"\n",
        "\n",
        "    mean_reward = np.mean(rewards_batch)\n",
        "    threshold = np.percentile(rewards_batch, percentile)\n",
        "    log.append([mean_reward, threshold])\n",
        "\n",
        "    clear_output(True)\n",
        "    print(\"mean reward = %.3f, threshold=%.3f\" % (mean_reward, threshold))\n",
        "    plt.figure(figsize=[8, 4])\n",
        "    plt.subplot(1, 2, 1)\n",
        "    plt.plot(list(zip(*log))[0], label=\"Mean rewards\")\n",
        "    plt.plot(list(zip(*log))[1], label=\"Reward thresholds\")\n",
        "    plt.legend()\n",
        "    plt.grid()\n",
        "\n",
        "    plt.subplot(1, 2, 2)\n",
        "    plt.hist(rewards_batch, range=reward_range)\n",
        "    plt.vlines(\n",
        "        [np.percentile(rewards_batch, percentile)],\n",
        "        [0],\n",
        "        [100],\n",
        "        label=\"percentile\",\n",
        "        color=\"red\",\n",
        "    )\n",
        "    plt.legend()\n",
        "    plt.grid()\n",
        "\n",
        "    plt.show()\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "euK7WRQiG9zM"
      },
      "outputs": [],
      "source": [
        "n_sessions = 100\n",
        "percentile = 70\n",
        "log = []\n",
        "\n",
        "for i in range(100):\n",
        "    # generate new sessions\n",
        "    sessions = [ <YOUR CODE: generate a list of n_sessions new sessions> ]\n",
        "\n",
        "    states_batch, actions_batch, rewards_batch = map(np.array, zip(*sessions))\n",
        "\n",
        "    elite_states, elite_actions = <YOUR CODE: select elite actions just like before>\n",
        "\n",
        "    <YOUR CODE: partial_fit agent to predict elite_actions(y) from elite_states(X)>\n",
        "\n",
        "    show_progress(\n",
        "        rewards_batch, log, percentile, reward_range=[0, np.max(rewards_batch)]\n",
        "    )\n",
        "\n",
        "    if np.mean(rewards_batch) > 190:\n",
        "        print(\"You Win! You may stop training now via KeyboardInterrupt.\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yeNWKjtsG9zM"
      },
      "source": [
        "# Results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RJwsWl4kG9zM"
      },
      "outputs": [],
      "source": [
        "# Record sessions\n",
        "\n",
        "from gymnasium.wrappers import RecordVideo\n",
        "\n",
        "with RecordVideo(\n",
        "    env=gym.make(\"CartPole-v0\", render_mode=\"rgb_array\"),\n",
        "    video_folder=\"./videos\",\n",
        "    episode_trigger=lambda episode_number: True,\n",
        ") as env_monitor:\n",
        "    sessions = [generate_session(env_monitor, agent) for _ in range(100)]\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kLPXdME7G9zN"
      },
      "outputs": [],
      "source": [
        "# Show video. This may not work in some setups. If it doesn't\n",
        "# work for you, you can download the videos and view them locally.\n",
        "\n",
        "from pathlib import Path\n",
        "from base64 import b64encode\n",
        "from IPython.display import HTML\n",
        "\n",
        "video_paths = sorted([s for s in Path(\"videos\").iterdir() if s.suffix == \".mp4\"])\n",
        "video_path = video_paths[-1]  # You can also try other indices\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    # https://stackoverflow.com/a/57378660/1214547\n",
        "    with video_path.open(\"rb\") as fp:\n",
        "        mp4 = fp.read()\n",
        "    data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
        "else:\n",
        "    data_url = str(video_path)\n",
        "\n",
        "HTML(\n",
        "    \"\"\"\n",
        "<video width=\"640\" height=\"480\" controls>\n",
        "  <source src=\"{}\" type=\"video/mp4\">\n",
        "</video>\n",
        "\"\"\".format(\n",
        "        data_url\n",
        "    )\n",
        ")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6d_3oOQ1G9zN"
      },
      "source": [
        "# Homework part I\n",
        "\n",
        "### Tabular crossentropy method\n",
        "\n",
        "You may have noticed that the taxi problem quickly converges from -100 to a near-optimal score and then descends back into -50/-100. This is in part because the environment has some innate randomness. Namely, the starting points of passenger/driver change from episode to episode.\n",
        "\n",
        "### Tasks\n",
        "- __1.1__ (2 pts) Find out how the algorithm performance changes if you use a different `percentile` and/or `n_sessions`. Provide here some figures so we can see how the hyperparameters influence the performance.\n",
        "- __1.2__ (1 pts) Tune the algorithm to end up with positive average score.\n",
        "\n",
        "It's okay to modify the existing code.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L88LySiVG9zN"
      },
      "source": [
        "```<Describe what you did here>```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7LpAJc4rG9zN"
      },
      "source": [
        "# Homework part II\n",
        "\n",
        "### Deep crossentropy method\n",
        "\n",
        "By this moment, you should have got enough score on [CartPole-v0](https://gymnasium.farama.org/environments/classic_control/cart_pole/) to consider it solved (see the link). It's time to try something harder.\n",
        "\n",
        "* if you have any trouble with CartPole-v0 and feel stuck, feel free to ask us or your peers for help.\n",
        "\n",
        "### Tasks\n",
        "\n",
        "* __2.1__ (3 pts) Pick one of environments: `MountainCar-v0` or `LunarLander-v2`.\n",
        "  * For MountainCar, get average reward of __at least -150__\n",
        "  * For LunarLander, get average reward of __at least +50__\n",
        "\n",
        "See the tips section below, it's kinda important.\n",
        "__Note:__ If your agent is below the target score, you'll still get some of the points depending on the result, so don't be afraid to submit it.\n",
        "  \n",
        "  \n",
        "* __2.2__ (up to 6 pts) Devise a way to speed up training against the default version\n",
        "  * Obvious improvement: use [`joblib`](https://joblib.readthedocs.io/en/latest/). However, note that you will probably need to spawn a new environment in each of the workers instead of passing it via pickling. (2 pts)\n",
        "  * Try re-using samples from 3-5 last iterations when computing threshold and training. (2 pts)\n",
        "  * Obtain __-100__ at `MountainCar-v0` or __+200__ at `LunarLander-v2` (2 pts). Feel free to experiment with hyperparameters, architectures, schedules etc.\n",
        "  \n",
        "__Please list what you did in Anytask submission form__. This reduces probability that somebody misses something.\n",
        "  \n",
        "  \n",
        "### Tips\n",
        "* Gymnasium pages: [MountainCar](https://gymnasium.farama.org/environments/classic_control/mountain_car/), [LunarLander](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n",
        "* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.\n",
        " * Also it may be a good idea to cut rewards via \">\" and not \">=\". If 90% of your sessions get reward of -10k and 10% are better, than if you use percentile 20% as threshold, R >= threshold __fails to cut off bad sessions__ while R > threshold works alright.\n",
        "* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make(\"MountainCar-v0\").env` or otherwise get rid of TimeLimit wrapper.\n",
        "* If you use old _swig_ lib for LunarLander-v2, you may get an error. See this [issue](https://github.com/openai/gym/issues/100) for solution.\n",
        "* If it doesn't train, it's a good idea to plot reward distribution and record sessions: they may give you some clue. If they don't, call course staff :)\n",
        "* 20-neuron network is probably not enough, feel free to experiment.\n",
        "\n",
        "You may find the following snippet useful:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qcjz-nm_G9zN"
      },
      "outputs": [],
      "source": [
        "def visualize_mountain_car(env, agent):\n",
        "    # Compute policy for all possible x and v (with discretization)\n",
        "    xs = np.linspace(env.min_position, env.max_position, 100)\n",
        "    vs = np.linspace(-env.max_speed, env.max_speed, 100)\n",
        "\n",
        "    grid = np.dstack(np.meshgrid(xs, vs[::-1])).transpose(1, 0, 2)\n",
        "    grid_flat = grid.reshape(len(xs) * len(vs), 2)\n",
        "    probs = (\n",
        "        agent.predict_proba(grid_flat).reshape(len(xs), len(vs), 3).transpose(1, 0, 2)\n",
        "    )\n",
        "\n",
        "    # # The above code is equivalent to the following:\n",
        "    # probs = np.empty((len(vs), len(xs), 3))\n",
        "    # for i, v in enumerate(vs[::-1]):\n",
        "    #     for j, x in enumerate(xs):\n",
        "    #         probs[i, j, :] = agent.predict_proba([[x, v]])[0]\n",
        "\n",
        "    # Draw policy\n",
        "    f, ax = plt.subplots(figsize=(7, 7))\n",
        "    ax.imshow(\n",
        "        probs,\n",
        "        extent=(env.min_position, env.max_position, -env.max_speed, env.max_speed),\n",
        "        aspect=\"auto\",\n",
        "    )\n",
        "    ax.set_title(\"Learned policy: red=left, green=nothing, blue=right\")\n",
        "    ax.set_xlabel(\"position (x)\")\n",
        "    ax.set_ylabel(\"velocity (v)\")\n",
        "\n",
        "    # Sample a trajectory and draw it\n",
        "    states, actions, _ = generate_session(env, agent)\n",
        "    states = np.array(states)\n",
        "    ax.plot(states[:, 0], states[:, 1], color=\"white\")\n",
        "\n",
        "    # Draw every 3rd action from the trajectory\n",
        "    for (x, v), a in zip(states[::3], actions[::3]):\n",
        "        if a == 0:\n",
        "            plt.arrow(x, v, -0.1, 0, color=\"white\", head_length=0.02)\n",
        "        elif a == 2:\n",
        "            plt.arrow(x, v, 0.1, 0, color=\"white\", head_length=0.02)\n",
        "\n",
        "\n",
        "with gym.make(\"MountainCar-v0\", render_mode=\"rgb_arrary\").env as env:\n",
        "    visualize_mountain_car(env, agent)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dzk41lDPG9zO"
      },
      "source": [
        "### Bonus tasks\n",
        "\n",
        "* __2.3 bonus__ (2 pts) Try to find a network architecture and training params that solve __both__ environments above (_Points depend on implementation. If you attempted this task, please mention it in Anytask submission._)\n",
        "\n",
        "* __2.4 bonus__ (4 pts) Solve continuous action space task with `MLPRegressor` or similar.\n",
        "  * Since your agent only predicts the \"expected\" action, you will have to add noise to ensure exploration.\n",
        "  * Choose one of [MountainCarContinuous-v0](https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/) (90+ pts to solve), [LunarLanderContinuous-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/) (`env = gym.make(\"LunarLander-v2\", continuous=True)`)(200+ pts to solve)\n",
        "  * 4 points for solving. Slightly less for getting some results below solution threshold. Note that discrete and continuous environments may have slightly different rules, aside from action spaces."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
